options(repos = c(CRAN = "https://cloud.r-project.org"))
## datatable function from DT package create an HTML widget display of the dataset
## install DT package if the package is not yet available in your R environment
readxl::read_excel("dataset/dataset-variable-description.xlsx") |>
DT::datatable()Insert title here
BCon 147: special topics
1 Project overiew
In this project, we will explore employee attrition and performance using the HR Analytics Employee Attrition & Performance dataset. The primary goal is to develop insights into the factors that contribute to employee attrition. By analyzing a range of factors, including demographic data, job satisfaction, work-life balance, and job role, we aim to help businesses identify key areas where they can improve employee retention.
2 Scenario
Imagine you are working as a data analyst for a mid-sized company that is experiencing high employee turnover, especially among high-performing employees. The company has been facing increased costs related to hiring and training new employees, and management is concerned about the negative impact on productivity and morale. The human resources (HR) team has collected historical employee data and now looks to you for actionable insights. They want to understand why employees are leaving and how to retain talent effectively.
Your task is to analyze the dataset and provide insights that will help HR prioritize retention strategies. These strategies could include interventions like revising compensation policies, improving job satisfaction, or focusing on work-life balance initiatives. The success of your analysis could lead to significant cost savings for the company and an increase in employee engagement and performance.
3 Understanding data source
The dataset used for this project provides information about employee demographics, performance metrics, and various satisfaction ratings. The dataset is particularly useful for exploring how factors such as job satisfaction, work-life balance, and training opportunities influence employee performance and attrition.
This dataset is well-suited for conducting in-depth analysis of employee performance and retention, enabling us to build predictive models that identify the key drivers of employee attrition. Additionally, we can assess the impact of various organizational factors, such as training and work-life balance, on both performance and retention outcomes.
4 Data wrangling and management
Libraries
Before we start working on the dataset, we need to load the necessary libraries that will be used for data wrangling, analysis and visualization. Make sure to load the following libraries here. For packages to be installed, you can use the install.packages function. There are packages to be installed later on this project, so make sure to install them as needed and load them here.
# load all your libraries here
if (!require(magrittr)) install.packages("magrittr");library(magrittr)
if (!require(dplyr)) install.packages("dplyr");library(dplyr)
if (!require(tidyverse)) install.packages("tidyverse"); library(tidyverse)
if (!require(ggplot2)) install.packages("ggplot2"); library(ggplot2)
if (!require(readr)) install.packages("readr"); library(readr)
if (!require(DT)) install.packages("DT"); library(DT)
if (!require(janitor)) install.packages("janitor");library(janitor)
if (!require(GGally)) install.packages("GGally"); library(GGally)
if (!require(broom)) install.packages("broom"); library(broom)
if (!require(parameters)) install.packages("parameters"); library(parameters)
if (!require(knitr)) install.packages("knitr"); library(knitr)
if (!require(scales)) install.packages("scales"); library(scales)4.1 Data importation
Import the two dataset
Employee.csvandPerformanceRating.csv. Save theEmployee.csvasemployee_dtaandPerformanceRating.csvasperf_rating_dta.Merge the two dataset using the
left_joinfunction fromdplyr. Use theEmployeeIDvariable as the varible to join by. You may read more information about theleft_joinfunction here.Save the merged dataset as
hr_perf_dtaand display the dataset using thedatatablefunction fromDTpackage.
## import the two data here
employee_dta <- read.csv("C:/Users/rekca/Documents/-------- 4th YEAR-1st SEM ECOMONICS/Special Topics (R)/R-Outputs/project/Midterm_R_Project/midterm-bcon147-project-exercise/dataset/Employee.csv")
perf_rating_dta <- read.csv("C:/Users/rekca/Documents/-------- 4th YEAR-1st SEM ECOMONICS/Special Topics (R)/R-Outputs/project/Midterm_R_Project/midterm-bcon147-project-exercise/dataset/PerformanceRating.csv")
## merge employee_dta and perf_rating_dta using left_join function.
## save the merged dataset as hr_perf_dta
hr_perf_dta <- merge(x = employee_dta, y = perf_rating_dta , by = "EmployeeID", all.x = TRUE)
## Use the datatable from DT package to display the merged dataset
datatable(hr_perf_dta)4.2 Data management
Using the
clean_namesfunction fromjanitorpackage, standardize the variable names by using the recommended naming of variables.Save the renamed variables as
hr_perf_dtato update the dataset.
## clean names using the janitor packages and save as hr_perf_dta
hr_perf_dta <- clean_names(hr_perf_dta)
## display the renamed hr_perf_dta using datatable function
datatable(hr_perf_dta)Create a new variable
cat_educationwhereineducationis1=No formal education;2=High school;3=Bachelor;4=Masters;5=Doctorate. Use thecase_whenfunction to accomplish this task.Similarly, create new variables
cat_envi_sat,cat_job_sat, andcat_relation_satforenvironment_satisfaction,job_satisfaction, andrelationship_satisfaction, respectively. Re-code the values accordingly as1=Very dissatisfied;2=Dissatisfied;3=Neutral;4=Satisfied; and5=Very satisfied.Create new variables
cat_work_life_balance,cat_self_rating,cat_manager_ratingforwork_life_balance,self_rating, andmanager_rating, respectively. Re-code accordingly as1=Unacceptable;2=Needs improvement;3=Meets expectation;4=Exceeds expectation; and5=Above and beyond.Create a new variable
bi_attritionby transformingattritionvariable as a numeric variabe. Re-code accordingly asNo=0, and 3Yes=1.Save all the changes in the
hr_perf_dta. Note that saving the changes with the same name will update the dataset with the new variables created.
## create cat_education
colnames(hr_perf_dta) [1] "employee_id" "first_name"
[3] "last_name" "gender"
[5] "age" "business_travel"
[7] "department" "distance_from_home_km"
[9] "state" "ethnicity"
[11] "education" "education_field"
[13] "job_role" "marital_status"
[15] "salary" "stock_option_level"
[17] "over_time" "hire_date"
[19] "attrition" "years_at_company"
[21] "years_in_most_recent_role" "years_since_last_promotion"
[23] "years_with_curr_manager" "performance_id"
[25] "review_date" "environment_satisfaction"
[27] "job_satisfaction" "relationship_satisfaction"
[29] "training_opportunities_within_year" "training_opportunities_taken"
[31] "work_life_balance" "self_rating"
[33] "manager_rating"
hr_perf_dta <- hr_perf_dta %>% mutate(cat_education = case_when(education == 1 ~ "No formal education", education == 2 ~ "High school", education == 3 ~ "Bachelor", education == 4 ~ "Masters", education == 5 ~ "Doctorate",TRUE ~ NA_character_ ))
## create cat_envi_sat, cat_job_sat, and cat_relation_sat
hr_perf_dta <- hr_perf_dta %>% mutate(cat_envi_sat = case_when(
environment_satisfaction == 1 ~ "Very dissatisfied",
environment_satisfaction == 2 ~ "Dissatisfied",
environment_satisfaction == 3 ~ "Neutral",
environment_satisfaction == 4 ~ "Satisfied",
environment_satisfaction == 5 ~ "Very satisfied",
TRUE ~ NA_character_
)) %>%
# Recode job satisfaction
mutate(cat_job_sat = case_when(
job_satisfaction == 1 ~ "Very dissatisfied",
job_satisfaction == 2 ~ "Dissatisfied",
job_satisfaction == 3 ~ "Neutral",
job_satisfaction == 4 ~ "Satisfied",
job_satisfaction == 5 ~ "Very satisfied",
TRUE ~ NA_character_
)) %>%
# Recode relationship satisfaction
mutate(cat_relation_sat = case_when(
relationship_satisfaction == 1 ~ "Very dissatisfied",
relationship_satisfaction == 2 ~ "Dissatisfied",
relationship_satisfaction == 3 ~ "Neutral",
relationship_satisfaction == 4 ~ "Satisfied",
relationship_satisfaction == 5 ~ "Very satisfied",
TRUE ~ NA_character_))
datatable(hr_perf_dta)## create cat_work_life_balance, cat_self_rating, and cat_manager_rating
hr_perf_dta <- hr_perf_dta %>% mutate(cat_work_life_balance = case_when(
work_life_balance == 1 ~ "Unacceptable",
work_life_balance == 2 ~ "Needs improvement",
work_life_balance == 3 ~ "Meets expectation",
work_life_balance == 4 ~ "Exceeds expectation",
work_life_balance == 5 ~ "Above and beyond",
TRUE ~ NA_character_
)) %>%
# Recode self-rating
mutate(cat_self_rating = case_when(
self_rating == 1 ~ "Unacceptable",
self_rating == 2 ~ "Needs improvement",
self_rating == 3 ~ "Meets expectation",
self_rating == 4 ~ "Exceeds expectation",
self_rating == 5 ~ "Above and beyond",
TRUE ~ NA_character_
)) %>%
# Recode manager rating
mutate(cat_manager_rating = case_when(
manager_rating == 1 ~ "Unacceptable",
manager_rating == 2 ~ "Needs improvement",
manager_rating == 3 ~ "Meets expectation",
manager_rating == 4 ~ "Exceeds expectation",
manager_rating == 5 ~ "Above and beyond",
TRUE ~ NA_character_
))
datatable(hr_perf_dta)## create bi_attrition
hr_perf_dta <- hr_perf_dta %>%
mutate(bi_attrition = if_else(attrition == "Yes", 1, 0))
datatable(hr_perf_dta)## print the updated hr_perf_dta using datatable function
datatable(hr_perf_dta)5 Exploratory data analysis
5.1 Descriptive statistics of employee attrition
Select the variables
attrition,job_role,department,age,salary,job_satisfaction, andwork_life_balance.Save asattrition_key_var_dta.Compute and plot the attrition rate across
job_role,department, andage,salary,job_satisfaction, andwork_life_balance. To compute for the attrition rate, group the dataset by job role. Afterward, you can use thecountfunction to get the frequency of attrition for each job role and then divide it by the total number of observations. Save the computation aspct_attrition. Do not forget to ungroup before storing the output. Store the output asattrition_rate_job_role.Plot for the attrition rate across
job_rolehas been done for you! Study each line of code. You have the freedom to customize your plot accordingly. Show your creativity!
## selecting attrition key variables and save as `attrition_key_var_dta`
attrition_key_var_dta <- hr_perf_dta %>% select(bi_attrition, job_role, department, age, salary, cat_job_sat,cat_work_life_balance)
## compute the attrition rate across job_role and save as attrition_rate_job_role
attrition_rate_job_role <- attrition_key_var_dta %>%
group_by(job_role) %>%
summarise(
total_employees = n(),
total_attrition = sum(bi_attrition == 1, na.rm = TRUE),
pct_attrition = (total_attrition / total_employees) * 100
) %>%
ungroup()
## print attrition_rate_job_role
print(attrition_rate_job_role)# A tibble: 13 × 4
job_role total_employees total_attrition pct_attrition
<chr> <int> <int> <dbl>
1 Analytics Manager 213 28 13.1
2 Data Scientist 1387 597 43.0
3 Engineering Manager 307 18 5.86
4 HR Business Partner 25 0 0
5 HR Executive 119 29 24.4
6 HR Manager 17 0 0
7 Machine Learning Engineer 582 95 16.3
8 Manager 145 19 13.1
9 Recruiter 152 86 56.6
10 Sales Executive 1567 543 34.7
11 Sales Representative 500 317 63.4
12 Senior Software Engineer 512 84 16.4
13 Software Engineer 1373 445 32.4
## Plot the attrition rate
ggplot(attrition_rate_job_role, aes(x = reorder(job_role, pct_attrition), y = pct_attrition)) +
geom_bar(stat = "identity", fill = "lightgreen", color = "darkgreen") +
geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5, size = 3.5) +
labs(title = "Attrition Rate by Job Role",
x = "Job Role",
y = "Attrition Rate (%)") +
ylim(0, 80) + # Set y-axis limit
theme_bw() + # Using a theme
theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", color = "darkblue"), # Rotate and style x-axis labels
plot.title = element_text(hjust = 0.5, face = "bold", color = "darkred"),
plot.margin = unit(c(1, 1, 1, 1.5), "cm")) ## Compute attrition rate by department
attrition_rate_department <- attrition_key_var_dta %>%
group_by(department) %>%
summarise(
total_employees = n(),
total_attrition = sum(bi_attrition == 1, na.rm = TRUE),
pct_attrition = (total_attrition / total_employees) * 100
) %>%
ungroup()
## Plot attrition rate by department
ggplot(attrition_rate_department, aes(x = reorder(department, pct_attrition), y = pct_attrition)) +
geom_bar(stat = "identity", fill = "lightblue", color = "darkblue") +
geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5, size = 3.5) +
labs(title = "Attrition Rate by Department",
x = "Department",
y = "Attrition Rate (%)") +
ylim(0, 80) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", color = "darkblue"),
plot.title = element_text(hjust = 0.5, face = "bold", color = "darkred"),
plot.margin = unit(c(1, 1, 1, 1.5), "cm"))## Compute attrition rate by age group (you may want to bin ages into groups)
attrition_rate_age <- attrition_key_var_dta %>%
mutate(age_group = cut(age, breaks = c(20, 30, 40, 50, 60, 70),
labels = c("20-29", "30-39", "40-49", "50-59", "60-69"))) %>%
group_by(age_group) %>%
summarise(
total_employees = n(),
total_attrition = sum(bi_attrition == 1, na.rm = TRUE),
pct_attrition = (total_attrition / total_employees) * 100
) %>%
ungroup()
## Plot attrition rate by age group
ggplot(attrition_rate_age, aes(x = age_group, y = pct_attrition)) +
geom_bar(stat = "identity", fill = "lightcoral", color = "darkred") +
geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5, size = 3.5) +
labs(title = "Attrition Rate by Age Group",
x = "Age Group",
y = "Attrition Rate (%)") +
ylim(0, 80) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", color = "darkblue"),
plot.title = element_text(hjust = 0.5, face = "bold", color = "darkred"),
plot.margin = unit(c(1, 1, 1, 1.5), "cm"))## Compute attrition rate by salary (you may want to bin salaries into ranges)
attrition_rate_salary <- attrition_key_var_dta %>%
mutate(salary_range = cut(salary, breaks = c(0, 50000, 100000, 150000, 200000),
labels = c("0-50k", "50k-100k", "100k-150k", "150k+"))) %>%
group_by(salary_range) %>%
summarise(
total_employees = n(),
total_attrition = sum(bi_attrition == 1, na.rm = TRUE),
pct_attrition = (total_attrition / total_employees) * 100
) %>%
ungroup()
## Plot attrition rate by salary range
ggplot(attrition_rate_salary, aes(x = salary_range, y = pct_attrition)) +
geom_bar(stat = "identity", fill = "lightyellow", color = "orange") +
geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5, size = 3.5) +
labs(title = "Attrition Rate by Salary Range",
x = "Salary Range",
y = "Attrition Rate (%)") +
ylim(0, 80) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", color = "darkblue"),
plot.title = element_text(hjust = 0.5, face = "bold", color = "darkred"),
plot.margin = unit(c(1, 1, 1, 1.5), "cm"))## Compute attrition rate by job satisfaction
attrition_rate_job_sat <- attrition_key_var_dta %>%
group_by(cat_job_sat) %>%
summarise(
total_employees = n(),
total_attrition = sum(bi_attrition == 1, na.rm = TRUE),
pct_attrition = (total_attrition / total_employees) * 100
) %>%
ungroup()
## Plot attrition rate by job satisfaction
ggplot(attrition_rate_job_sat, aes(x = reorder(cat_job_sat, pct_attrition), y = pct_attrition)) +
geom_bar(stat = "identity", fill = "lightpink", color = "darkred") +
geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5, size = 3.5) +
labs(title = "Attrition Rate by Job Satisfaction",
x = "Job Satisfaction",
y = "Attrition Rate (%)") +
ylim(0, 80) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", color = "darkblue"),
plot.title = element_text(hjust = 0.5, face = "bold", color = "darkred"),
plot.margin = unit(c(1, 1, 1, 1.5), "cm"))## Compute attrition rate by work-life balance
attrition_rate_work_life <- attrition_key_var_dta %>%
group_by(cat_work_life_balance) %>%
summarise(
total_employees = n(),
total_attrition = sum(bi_attrition == 1, na.rm = TRUE),
pct_attrition = (total_attrition / total_employees) * 100
) %>%
ungroup()
## Plot attrition rate by work-life balance
ggplot(attrition_rate_work_life, aes(x = reorder(cat_work_life_balance, pct_attrition), y = pct_attrition)) +
geom_bar(stat = "identity", fill = "purple", color = "magenta") +
geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5, size = 3.5) +
labs(title = "Attrition Rate by Work-Life Balance",
x = "Work-Life Balance",
y = "Attrition Rate (%)") +
ylim(0, 80) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", color = "darkblue"),
plot.title = element_text(hjust = 0.5, face = "bold", color = "darkred"),
plot.margin = unit(c(1, 1, 1, 1.5), "cm"))5.2 Identifying attrition key drivers using correlation analysis
Conduct a correlation analysis of key variables:
bi_attrition,salary,years_at_company,job_satisfaction,manager_rating, andwork_life_balance. Use thecor()function to run the correlation analysis. Remove missing values using thena.omit()before running the correlation analysis. Save the output inhr_corr.Use a correlation matrix or heatmap to visualize the relationship between these variables and attrition. You can use the
GGallypackage and use theggcorrfunction to visualize the correlation heatmap. You may explore this site for more information: ggcorr.Discuss which factors seem most correlated with attrition and what that suggests aobut why employees are leaving.
## conduct correlation of key variables.
key_vars <- hr_perf_dta %>%
select(bi_attrition, salary, years_at_company, cat_job_sat, manager_rating, cat_work_life_balance) %>%
mutate(
cat_job_sat = as.numeric(factor(cat_job_sat, levels = c("Very dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very satisfied"))),
cat_work_life_balance = as.numeric(factor(cat_work_life_balance, levels = c("Unacceptable", "Needs improvement", "Meets expectation", "Exceeds expectation", "Above and beyond")))
) %>%
na.omit()
hr_corr <- cor(key_vars )
## print hr_corr
print(hr_corr) bi_attrition salary years_at_company cat_job_sat
bi_attrition 1.000000000 -0.211181478 -0.6896527798 0.0132368129
salary -0.211181478 1.000000000 0.2206442116 0.0053054850
years_at_company -0.689652780 0.220644212 1.0000000000 0.0008700583
cat_job_sat 0.013236813 0.005305485 0.0008700583 1.0000000000
manager_rating -0.007654429 -0.001596736 0.0178656879 -0.0158205481
cat_work_life_balance 0.003428836 -0.001517145 0.0079339508 0.0417242942
manager_rating cat_work_life_balance
bi_attrition -0.007654429 0.003428836
salary -0.001596736 -0.001517145
years_at_company 0.017865688 0.007933951
cat_job_sat -0.015820548 0.041724294
manager_rating 1.000000000 0.007996938
cat_work_life_balance 0.007996938 1.000000000
## install GGally package and use ggcorr function to visualize the correlation
if(!require(corrplot)) install.packages("corrplot")
library(corrplot)
if(!require(GGally)) install.packages("GGally"); library(GGally)
if(!require(reshape2)) install.packages("reshape2"); library(reshape2)
# Plot the correlation matrix
corr_matrix <- cor(key_vars, use = "complete.obs")
melted_corr <- melt(corr_matrix)
ggplot(data = melted_corr, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "red", high = "green", mid = "white",
midpoint = 0, limit = c(-1,1), name="Correlation") +
geom_text(aes(Var1, Var2, label = round(value, 2)), color = "black") +
ggtitle("Correlation Heatmap of Key Variables") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold", color = "darkblue"),
axis.text.y = element_text(hjust = 1, vjust = 0.5),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1)) The correlation analysis shows that years at the company and salary are the most important factors influencing employee attrition. Employees with longer tenure are less likely to leave, as seen in the strong negative correlation between attrition and years_at_company (-0.690), while those earning lower salaries are more likely to leave, indicated by the negative correlation between attrition and salary (-0.211). Interestingly, job satisfaction (0.013), manager rating (-0.008), and work-life balance (0.003) have little to no effect on attrition, suggesting these factors are less important for retention compared to tenure and compensation. Additionally, salary tends to increase with tenure (0.221), hinting at a structured pay system. Overall, companies should focus on retaining employees through competitive salaries and promoting long-term commitment, as these are more directly tied to reducing turnover than factors like job satisfaction or manager performance.The correlation analysis shows that years at the company and salary are the most important factors influencing employee attrition. Employees with longer tenure are less likely to leave, as seen in the strong negative correlation between attrition and years_at_company (-0.690), while those earning lower salaries are more likely to leave, indicated by the negative correlation between attrition and salary (-0.211). Interestingly, job satisfaction (0.013), manager rating (-0.008), and work-life balance (0.003) have little to no effect on attrition, suggesting these factors are less important for retention compared to tenure and compensation. Additionally, salary tends to increase with tenure (0.221), hinting at a structured pay system. Overall, companies should focus on retaining employees through competitive salaries and promoting long-term commitment, as these are more directly tied to reducing turnover than factors like job satisfaction or manager performance.
Create a logistic regression model to predict employee attrition using the following variables:
salary,years_at_company,job_satisfaction,manager_rating, andwork_life_balance. Save the model ashr_attrition_glm_model. Print the summary of the model using thesummaryfunction.Install the
sjPlotpackage and use thetab_modelfunction to display the summary of the model. You may read the documentation here on how to customize your model summary.Also, use the
plot_modelfunction to visualize the model coefficients. You may read the documentation here on how to customize your model visualization.Discuss the results of the logistic regression model and what they suggest about the factors that contribute to employee attrition.
## run a logistic regression model to predict employee attrition
## save the model as hr_attrition_glm_model
hr_attrition_glm_model <- glm(
bi_attrition ~ salary + years_at_company + cat_job_sat + manager_rating + cat_work_life_balance,
data = key_vars,
family = binomial() )
## print the summary of the model using the summary function
summary(hr_attrition_glm_model)
Call:
glm(formula = bi_attrition ~ salary + years_at_company + cat_job_sat +
manager_rating + cat_work_life_balance, family = binomial(),
data = key_vars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.571e+00 2.173e-01 11.831 <2e-16 ***
salary -3.633e-06 4.086e-07 -8.893 <2e-16 ***
years_at_company -6.333e-01 1.476e-02 -42.919 <2e-16 ***
cat_job_sat 3.470e-02 3.186e-02 1.089 0.276
manager_rating 5.071e-03 3.810e-02 0.133 0.894
cat_work_life_balance 2.587e-02 3.198e-02 0.809 0.419
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8574.5 on 6708 degrees of freedom
Residual deviance: 4781.6 on 6703 degrees of freedom
AIC: 4793.6
Number of Fisher Scoring iterations: 5
## install sjPlot package and use tab_model function to display the summary of the model
if(!require(sjPlot)) install.packages("sjPlot"); library(sjPlot)
tab_model(hr_attrition_glm_model)| bi attrition | |||
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 13.08 | 8.56 – 20.07 | <0.001 |
| salary | 1.00 | 1.00 – 1.00 | <0.001 |
| years at company | 0.53 | 0.52 – 0.55 | <0.001 |
| cat job sat | 1.04 | 0.97 – 1.10 | 0.276 |
| manager rating | 1.01 | 0.93 – 1.08 | 0.894 |
| cat work life balance | 1.03 | 0.96 – 1.09 | 0.419 |
| Observations | 6709 | ||
| R2 Tjur | 0.502 | ||
## use plot_model function to visualize the model coefficients
# Aligning the plot_model design
plot_model(hr_attrition_glm_model,
type = "est",
show.values = TRUE,
value.offset = 0.3,
title = "Model Coefficients for Employee Attrition") +
theme_bw() + # Use the same theme
labs(title = "Model Coefficients for Employee Attrition",
x = "Variables",
y = "Estimates") + # Customizing labels
theme(plot.title = element_text(hjust = 0.5, face = "bold", color = "darkred"), # Aligning title style
axis.text.x = element_text(face = "bold", color = "darkblue"), # Styling x-axis labels
plot.margin = unit(c(1, 1, 1, 1), "cm")) # Setting marginsThe logistic regression model highlights several important factors influencing employee attrition. The most significant variables are salary and years at the company, both of which show strong associations with attrition. Salary has a negative estimate (-3.633e-06, p < 0.001), indicating that higher salaries are associated with a lower likelihood of attrition. Similarly, the number of years an employee stays at the company also reduces the odds of leaving, as reflected by the negative coefficient for years_at_company (-0.633, p < 0.001). These findings suggest that both competitive compensation and employee tenure play crucial roles in retaining employees.
In contrast, job satisfaction, manager rating, and work-life balance do not show statistically significant relationships with attrition. Job satisfaction has a positive estimate (0.0347, p = 0.276), and while it appears to increase the likelihood of leaving slightly, the result is not significant. Similarly, manager rating (0.005, p = 0.894) and work-life balance (0.025, p = 0.419) do not significantly influence attrition. This suggests that employees’ decisions to stay or leave are driven more by tangible factors like salary and tenure rather than their perceptions of satisfaction or managerial support.
5.3 Analysis of compensation and turnover
Compare the average monthly income of employees who left the company (
bi_attrition = 1) and those who stayed (bi_attrition = 0). Use thet.testfunction to conduct a t-test and determine if there is a significant difference in average monthly income between the two groups. Save the results in a variable calledattrition_ttest_results.Install the
reportpackage and use thereportfunction to generate a report of the t-test results.Install the
ggstatsplotpackage and use theggbetweenstatsfunction to visualize the distribution of monthly income for employees who left and those who stayed. Make sure to map thebi_attritionvariable to thexargument and thesalaryvariable to theyargument.Visualize the
salaryvariable for employees who left and those who stayed usinggeom_histogramwithgeom_freqpoly. Make sure to facet the plot by thebi_attritionvariable and applyalphaon the histogram plot.Provide recommendations on whether revising compensation policies could be an effective retention strategy.
## compare the average monthly income of employees who left and those who stayed
# Select relevant columns
sal_and_biAtt <- hr_perf_dta %>% select(bi_attrition, salary)
attrition_ttest_results <- t.test(salary ~ bi_attrition, data = sal_and_biAtt)
## print the results of the t-test
print(attrition_ttest_results)
Welch Two Sample t-test
data: salary by bi_attrition
t = 18.869, df = 5524.2, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
38577.82 47523.18
sample estimates:
mean in group 0 mean in group 1
125007.26 81956.76
## install the report package and use the report function to generate a report of the t-test results
if (!require(report)) install.packages("report"); library(report)
attrition_ttest_report <- report(attrition_ttest_results)
print(attrition_ttest_report)Effect sizes were labelled following Cohen's (1988) recommendations.
The Welch Two Sample t-test testing the difference of salary by bi_attrition
(mean in group 0 = 1.25e+05, mean in group 1 = 81956.76) suggests that the
effect is positive, statistically significant, and medium (difference =
43050.50, 95% CI [38577.82, 47523.18], t(5524.24) = 18.87, p < .001; Cohen's d
= 0.51, 95% CI [0.45, 0.56])
# Alternative 2 -- tidy summary of the t-test results
tidy_results <- tidy(attrition_ttest_results)
print(tidy_results)# A tibble: 1 × 10
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 43051. 125007. 81957. 18.9 5.17e-77 5524. 38578. 47523.
# ℹ 2 more variables: method <chr>, alternative <chr>
# Alternative 3 --- tidy summary of the t-test results
tidy_results <- parameters(attrition_ttest_results)
kable(tidy_results)| Parameter | Group | Mean_Group1 | Mean_Group2 | Difference | CI | CI_low | CI_high | t | df_error | p | Method | Alternative |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| salary | bi_attrition | 125007.3 | 81956.76 | 43050.5 | 0.95 | 38577.82 | 47523.18 | 18.8692 | 5524.236 | 0 | Welch Two Sample t-test | two.sided |
# install ggstatsplot package and use ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed
if (!require(ggstatsplot)) install.packages("ggstatsplot"); library(ggstatsplot)
ggbetweenstats(
data = hr_perf_dta,
x = bi_attrition,
y = salary,
xlab = "Attrition Status",
ylab = "Monthly Income",
title = "Monthly Income Distribution by Attrition Status",
ggtheme = ggplot2::theme_minimal(),
pairwise.comparisons = TRUE # To add pairwise comparisons between groups
)# create histogram and frequency polygon of salary for employees who left and those who stayed
salary_bins <- seq(0, 600000, by = 50000)
# Get frequency values using dplyr
frequency_df <- hr_perf_dta %>%
mutate(salary_bin = cut(salary, breaks = salary_bins, right = FALSE)) %>%
group_by(salary_bin, bi_attrition) %>%
summarise(Frequency = n(), .groups = 'drop')
# View the frequency table
print(frequency_df)# A tibble: 20 × 3
salary_bin bi_attrition Frequency
<fct> <dbl> <int>
1 [0,5e+04) 0 1031
2 [0,5e+04) 1 1111
3 [5e+04,1e+05) 0 1563
4 [5e+04,1e+05) 1 642
5 [1e+05,1.5e+05) 0 809
6 [1e+05,1.5e+05) 1 208
7 [1.5e+05,2e+05) 0 389
8 [1.5e+05,2e+05) 1 87
9 [2e+05,2.5e+05) 0 267
10 [2e+05,2.5e+05) 1 93
11 [2.5e+05,3e+05) 0 198
12 [2.5e+05,3e+05) 1 46
13 [3e+05,3.5e+05) 0 156
14 [3e+05,3.5e+05) 1 36
15 [3.5e+05,4e+05) 0 86
16 [3.5e+05,4e+05) 1 20
17 [4e+05,4.5e+05) 0 58
18 [4.5e+05,5e+05) 0 34
19 [5e+05,5.5e+05) 0 47
20 [5e+05,5.5e+05) 1 18
# A histogram of salary grouped by attrition status, with custom breaks
ggplot(hr_perf_dta, aes(x = salary, fill = factor(bi_attrition))) +
geom_histogram(alpha = 0.6, position = "identity", breaks = seq(0, 600000, by = 50000)) + # Bins every 50,000
scale_fill_manual(values = c("#00BFC4", "#F8766D"), labels = c("Stayed", "Left")) +
labs(title = "Salary Distribution for Employees Who Stayed vs. Left",
x = "Salary",
y = "Count",
fill = "Attrition Status") +
scale_x_continuous(limits = c(0, 600000),
breaks = seq(0, 600000, by = 50000),
labels = comma) + # Format x-axis labels with commas
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Slant x-axis labels# A frequency polygon of salary grouped by attrition status
ggplot(hr_perf_dta, aes(x = salary, color = factor(bi_attrition))) +
geom_freqpoly(linewidth = 1.5, bins = 30) + # Use linewidth instead of size
scale_color_manual(values = c("#00BFC4", "#F8766D"),
labels = c("Stayed", "Left")) + # Custom colors and labels
labs(title = "Frequency Polygon of Salary for Employees Who Stayed vs. Left",
x = "Salary",
y = "Count",
color = "Attrition Status") +
scale_x_continuous(labels = comma) + # Format x-axis labels with commas
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Slant x-axis labelsThe analysis reveals significant differences in salary distribution between employees who stayed with the company and those who left. Employees who departed (bi_attrition = 1) were predominantly found in lower salary bins. For instance, in the lowest salary bracket [0,50,000)[0, 50,000)[0,50,000), 1,111 employees left compared to 1,031 who remained, indicating a clear correlation between higher salaries and employee retention.
Statistical tests further support these findings, showing that employees who stayed earned an average salary of 125,007.3, while those who left earned 81,956.76. The mean difference of 43,050.5 with a confidence interval ranging from 38,577.82 to 47,523.18 underscores the role of salary in influencing retention.
This suggest that the company should consider a comprehensive review of its compensation policies. Adjusting salaries for lower-income employees or revising pay structures may improve retention rates, especially among those in lower salary brackets. Additionally, targeted retention strategies such as professional development and performance bonuses could help incentivize at-risk employees to remain.
5.4 Employee satisfaction and performance analysis
Analyze the average performance ratings (both
ManagerRatingandSelfRating) of employees who left vs. those who stayed. Use thegroup_byandcountfunctions to calculate the average performance ratings for each group.Visualize the distribution of
SelfRatingfor employees who left and those who stayed using a bar plot. Use theggplotfunction to create the plot and map theSelfRatingvariable to thexargument and thebi_attritionvariable to thefillargument.Similarly, visualize the distribution of
ManagerRatingfor employees who left and those who stayed using a bar plot. Make sure to map theManagerRatingvariable to thexargument and thebi_attritionvariable to thefillargument.Create a boxplot of
salarybyjob_satisfactionandbi_attritionto analyze the relationship between salary, job satisfaction, and attrition. Use thegeom_boxplotfunction to create the plot and map thesalaryvariable to thexargument, thejob_satisfactionvariable to theyargument, and thebi_attritionvariable to thefillargument. You need to transform thejob_satisfactionandbi_attritionvariables into factors before creating the plot or within theggplotfunction.Discuss the results of the analysis and provide recommendations for HR interventions based on the findings.
# Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed.
# Remove rows with NA values in Manager or Self Ratings, and apply the mapping
rating_manager_self_filtered <- hr_perf_dta %>%
filter(!is.na(cat_manager_rating) & !is.na(cat_self_rating)) %>% # Remove NA rows
mutate( cat_manager_rating = as.numeric(factor(cat_manager_rating, levels = c("Needs improvement", "Meets expectation", "Exceeds expectation", "Above and beyond"))),
cat_self_rating = as.numeric(factor(cat_self_rating, levels = c("Needs improvement", "Meets expectation", "Exceeds expectation", "Above and beyond")))
)
# Calculate average performance ratings
average_ratings_filtered <- rating_manager_self_filtered %>%
group_by(bi_attrition) %>%
summarize(
Average_ManagerRating = mean(cat_manager_rating, na.rm = TRUE), # Calculate average Manager Rating
Average_SelfRating = mean(cat_self_rating, na.rm = TRUE), # Calculate average Self Rating
)
# View the average performance ratings
print(average_ratings_filtered)# A tibble: 2 × 3
bi_attrition Average_ManagerRating Average_SelfRating
<dbl> <dbl> <dbl>
1 0 2.48 2.98
2 1 2.46 2.99
# Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot.
library(ggplot2)
# Create the bar chart with custom legend labels
ggplot(rating_manager_self_filtered, aes(x = factor(cat_self_rating,
levels = c(1, 2, 3, 4), # Ensuring the correct order
labels = c("Needs improvement", "Meets expectation",
"Exceeds expectation", "Above and beyond")),
fill = factor(bi_attrition))) +
geom_bar(position = "dodge", alpha = 0.7, color = "black") + # Add black outlines to the bars
scale_fill_manual(values = c("#00BFC4", "#F8766D"),
labels = c("Stayed", "Left")) + # Custom colors and labels
labs(title = "Distribution of Self Rating for Employees Who Stayed vs. Left",
x = "Self Rating",
y = "Count",
fill = "Attrition Status") +
theme_minimal()# Visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot.
ggplot(rating_manager_self_filtered, aes(x = factor(cat_manager_rating,
levels = c(1, 2, 3, 4),
labels = c("Needs improvement", "Meets expectation",
"Exceeds expectation", "Above and beyond")),
fill = factor(bi_attrition))) +
geom_bar(position = "dodge", alpha = 0.7, color = "black") + # Add black outlines to the bars
scale_fill_manual(values = c("#00BFC4", "#F8766D"),
labels = c("Stayed", "Left")) + # Custom colors and labels
labs(title = "Distribution of Manager Rating for Employees Who Stayed vs. Left",
x = "Self Rating",
y = "Count",
fill = "Attrition Status") +
theme_minimal()# create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition.
# Create a boxplot of salary by job satisfaction and attrition status
ggplot(hr_perf_dta, aes(x = factor(cat_job_sat,
levels = c("Very dissatisfied", "Dissatisfied",
"Neutral", "Satisfied", "Very satisfied")),
y = salary,
fill = factor(bi_attrition))) +
geom_boxplot(alpha = 0.7) + # Add boxplots with some transparency
scale_fill_manual(values = c("#00BFC4", "#F8766D"),
labels = c("Stayed", "Left")) + # Custom colors and labels
labs(title = "Boxplot of Salary by Job Satisfaction and Attrition Status",
x = "Job Satisfaction",
y = "Salary",
fill = "Attrition Status") +
theme_minimal()The analysis of average performance ratings reveals minimal differences between employees who left (bi_attrition = 1) and those who stayed (bi_attrition = 0). The Average ManagerRating for employees who stayed is approximately 2.48, compared to 2.46 for those who left, while the Average SelfRating is 2.98 for employees who stayed and 2.99 for those who left. These results suggest that there is no significant difference in performance ratings based on attrition status, indicating that the ratings from both managers and employees are relatively consistent across both groups. This consistency implies that factors beyond performance ratings may be influencing employee decisions to leave the organization, such as job satisfaction, work environment, or compensation.
Furthermore, the analysis reveals notable differences in average performance ratings between employees who left and those who stayed, with lower ratings indicating a potential disconnect between self-perceptions and managerial evaluations. This discrepancy suggests that departing employees may not have felt recognized or supported in their roles, contributing to their decision to leave. The SelfRating distribution likely shows a higher proportion of dissatisfied employees among those who exited, correlating self-perception with broader job dissatisfaction and increased turnover. Similarly, the ManagerRating distribution may reflect that managers identified these individuals as underperforming, pointing to inadequate support or development opportunities. Lastly, the boxplot of salary distributions by job satisfaction suggests that higher satisfaction levels are associated with higher salaries, reinforcing the idea that competitive compensation is critical for retention.
At this point, you are already well aware of the dataset and the possible factors that contribute to employee attrition. Using your R skills, accomplish the following tasks:
Analyze the distribution of WorkLifeBalance ratings for employees who left versus those who stayed.
table(hr_perf_dta$cat_work_life_balance, hr_perf_dta$bi_attrition)0 1 Above and beyond 994 516 Exceeds expectation 1146 560 Meets expectation 1090 580 Needs improvement 1134 568 Unacceptable 84 37Use visualizations to show the differences.
# Create a bar plot with diagonal x-axis labels ggplot(hr_perf_dta, aes(x = cat_work_life_balance, fill = factor(bi_attrition))) + geom_bar(position = "dodge") + labs(x = "Work-Life Balance Rating", y = "Count", title = "Distribution of Work-Life Balance Ratings for Employees Who Left vs Stayed", fill = "Attrition (0 = Stayed, 1 = Left)") + theme_minimal() + scale_fill_manual(values = c("0" = "green", "1" = "red")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))Assess whether employees with poor work-life balance are more likely to leave.
Based on the distribution and visual presentation of the data, it is evident that we cannot conclusively assert that poor work-life balance leads to a higher likelihood of employee attrition. The analysis indicates that the probability of employees leaving the organization remains relatively consistent across different work-life balance ratings. Further Chi-sqaure analysis will support the claim
work_life_balance_attrition <- table(hr_perf_dta$cat_work_life_balance, hr_perf_dta$bi_attrition) chi_sq_result <- chisq.test(work_life_balance_attrition) print(chi_sq_result)Pearson's Chi-squared test data: work_life_balance_attrition X-squared = 2.138, df = 4, p-value = 0.7104Null Hypothesis: There is no relationship between WorkLifeBalance and attrition (i.e., poor work-life balance is not associated with employees leaving).
Alternative Hypothesis: There is a relationship between WorkLifeBalance and attrition (i.e. poor work-life balance is associated with employees leaving).
Hence, we accept null hyphothesis and with p-value of 0.7104 it indicates that there is no statistically significant relationship between work-life balance and attrition.
You have the freedom how you will accomplish this task. Be creative and provide insights that will help HR develop effective retention strategies.
5.5 Recommendations for HR interventions
Based on the analysis conducted, provide recommendations for HR interventions that could help reduce employee attrition and improve overall employee satisfaction and performance. You may use the following question as guide for your recommendation```{r}```{r}s and discussions.
What are the key factors contributing to the atrrition in the company?
The key factors contributing to employee attrition in the company include salary and years at the company, with lower salaries correlating with higher turnover rates. Employees who earned lower wages were more likely to leave, while those with longer tenure exhibited greater retention. Additionally, performance ratings, including both self-assessments and managerial evaluations, showed minimal differences between employees who stayed and those who left, indicating that factors such as job satisfaction, work environment, and managerial support have less impact on attrition decisions. Consequently, the company should focus on competitive compensation and fostering long-term employee commitment to reduce turnover rates.
Which factors are most strongly correlated with attrition?
The factors most strongly correlated with attrition are:
Years at the Company: There is a strong negative correlation between years at the company and attrition, indicating that employees with longer tenure are less likely to leave.
Salary: A negative correlation exists between salary and attrition, suggesting that employees earning lower salaries are more likely to depart from the organization.
What strategies could be implemented to improve employee retention and satisfaction?
To improve employee retention and satisfaction, organizations can implement several effective strategies that address the diverse needs of their workforce. First, conducting regular reviews and adjustments of salary structures is essential to ensure competitive compensation, particularly for lower-income employees. This not only helps in attracting talent but also plays a crucial role in retaining current staff. Additionally, providing career development opportunities through training, mentorship, and clear advancement pathways encourages professional growth and significantly increases employee engagement. Comprehensive benefits packages that cater to various employee needs, including health insurance, retirement plans, and flexible work arrangements, further enhance job satisfaction and commitment.
Creating a positive work environment is vital; fostering a supportive and inclusive culture where employees feel valued and recognized for their contributions can lead to higher morale and loyalty. Implementing employee feedback mechanisms, such as regular surveys and feedback sessions, allows organizations to gauge satisfaction levels and address concerns proactively, demonstrating a commitment to continuous improvement. Moreover, promoting work-life balance through flexible arrangements, such as remote work options and adjustable hours, helps employees manage their personal and professional lives more effectively, which can lead to increased productivity and lower turnover rates.
Recognition programs that celebrate and reward employee achievements foster a sense of appreciation and motivation, reinforcing the value of their contributions to the organization. Lastly, organizing engagement activities, such as team-building events and social gatherings, strengthens relationships among employees, enhancing overall morale and collaboration within the workplace. By adopting these strategies, organizations can create a more engaged, satisfied, and committed workforce, ultimately leading to improved retention rates and overall organizational success.
How can HR leverage the insights from the analysis to develop effective retention strategies?
HR can leverage insights from the analysis to develop effective retention strategies by focusing on a few key areas. First, they should implement targeted salary adjustments for employees in lower pay brackets, as the analysis indicates a strong correlation between salary and attrition. Additionally, enhancing career development opportunities can help engage employees and provide clear pathways for advancement, thereby increasing job satisfaction. Regularly conducting employee feedback surveys will allow HR to identify specific concerns and areas for improvement, fostering a more responsive work environment. Implementing flexible work arrangements can also address work-life balance issues, promoting higher satisfaction levels. Lastly, establishing recognition programs to celebrate employee achievements can create a more positive workplace culture, further encouraging employees to stay with the company. By focusing on these practical strategies, HR can create a more supportive and engaging environment that enhances employee retention.
What are the potential benefits of implementing these strategies for the company?
Implementing these retention strategies can yield significant benefits for the company. First, improving salary structures and offering competitive compensation can enhance employee satisfaction and loyalty, leading to reduced turnover rates and lower recruitment costs. Enhanced career development opportunities can foster a culture of growth, resulting in a more skilled and motivated workforce, which directly impacts productivity and innovation. Regular feedback mechanisms can help the company identify and address issues proactively, creating a more engaged workforce and minimizing conflicts. By promoting work-life balance through flexible arrangements, the company can enhance employee well-being, leading to increased morale and lower absenteeism. Additionally, recognition programs can boost employee motivation and reinforce a positive organizational culture, enhancing overall job satisfaction. Collectively, these strategies can contribute to improved performance, higher employee retention, and ultimately, a stronger bottom line for the organization.